How Predictable Are Biological Sequences?

نویسندگان

  • Izydor Apostol
  • Philippe Jacquet
  • Wojciech Szpankowski
چکیده

A major challenge facing computational biology is the post-sequencing analysis of patterns and motifs in genomic DNA sequences. In this study, we apply asymptotically optimal Sampled Pattern Matching predictor, recently developed by us, to analyze biological sequences (i.e., proteins and DNA). The SPM predictor for a given sequence X1; : : : ; Xn predicts the next symbol Xn+1 (or next K symbols) based on selecting a context of Xn+1, that is, it predicts the value of the most frequent symbol appearing at the so called sampled positions. These positions follow the occurrences of a fraction of the longest suÆx of the original sequence that has another copy inside X1X2 : : : Xn. In our previous paper [11] we estimated the redundancy of the SPM universal predictor, that is, we established that the probability the SPM predictor makes worse decisions than the optimal predictor is O(n ) for some 0 < < 1 2 as n ! 1. When SPM is applied to molecular sequences it proves its suitability to protein prediction and DNA prediction. Finally, we should add that our ultimate goal is to bring solid methods of information theory to analysis of molecular sequences. This paper is our preliminary attempt.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A computational method to analyze the similarity of biological sequences under uncertainty

In this paper, we propose a new method to analyze the difference and similarity of biological sequences, based on the fuzzy sets theory. Considering the sequence order and some chemical and structural properties, we present a computational method to cluster the biological sequences. By some examples, we show that the new method is relatively easy and we are able to compare the sequences of arbi...

متن کامل

Mining Biological Repetitive Sequences Using Support Vector Machines and Fuzzy SVM

Structural repetitive subsequences are most important portion of biological sequences, which play crucial roles on corresponding sequence’s fold and functionality. Biggest class of the repetitive subsequences is “Transposable Elements” which has its own sub-classes upon contexts’ structures. Many researches have been performed to criticality determine the structure and function of repetitiv...

متن کامل

Relation Between RNA Sequences, Structures, and Shapes via Variation Networks

Background: RNA plays key role in many aspects of biological processes and its tertiary structure is critical for its biological function. RNA secondary structure represents various significant portions of RNA tertiary structure. Since the biological function of RNA is concluded indirectly from its primary structure, it would be important to analyze the relations between the RNA sequences and t...

متن کامل

Clustering of Short Read Sequences for de novo Transcriptome Assembly

Given the importance of transcriptome analysis in various biological studies and considering thevast amount of whole transcriptome sequencing data, it seems necessary to develop analgorithm to assemble transcriptome data. In this study we propose an algorithm fortranscriptome assembly in the absence of a reference genome. First, the contiguous sequencesare generated using de Bruijn graph with d...

متن کامل

The Phylogeny of Calligonum and Pteropyrum (Polygonaceae) Based on Nuclear Ribosomal DNA ITS and Chloroplast trnL-F Sequences

This study represents phylogenetic analyses of two woody polygonaceous genera Calligonum and Pteropyrum using both chloroplast fragment (trnL-F) and the nuclear ribosomal internal transcribed spacer (nrDNA ITS) sequence data. All inferred phylogenies using parsimony and Bayesian methods showed that Calligonum and Pteropyrum are both monophyletic and closely related taxa. They have no affinity w...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003